A Support Tool for Tagset Mapping 
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Abstract 

Many different tagsets are used in exist- 
ing corpora; these tagsets vary accord- 
ing to the objectives of specific projects 
(which may be as far apart as robust pars- 
ing vs. spelling correction). In many sit- 
uations, however, one would like to have 
uniform access to the linguistic informa- 
tion encoded in corpus annotations without 
having to know the classification schemes 
in detail. This paper describes a tool 
which maps unstructured morphosyntactic 
tags to a constraint-based, typed, config- 
urable specification language, a "standard 
tagset" . The mapping relics on a manually 
written set of mapping rules, which is au- 
tomatically checked for consistency. In cer- 
tain cases, unsharp mappings are unavoid- 
able, and noise, i.e. groups of word forms 
not conforming to the specification, will ap- 
pear in the output of the mapping. The 
system automatically detects such noise 
and informs the user about it. 

The tool has been tested with rules for the 



UPenn tagset ( Marcus et al. 92) and the 
SUSAN NE tagset jGarside, Leech, Samp- 
son 87|), in the framework of the EAGLES[| 
validation phase for standardised tagsets 
for European languages. 

1 Motivation 



Tagsets used in existing corpora have usually been 
designed to satisfy the needs of specific projects. A 
tagset used for robust parsing will tend to stress 
distributional properties, whereas a corpus within a 
lexical resource specially designed for human inter- 
action (which might include a human oriented dictio- 



nary) will most likely distinguish word classes along 
traditional linguistic lines. 

The tool described in this paper performs tagset 
mapping with manually written rules to introduce 
a standardised morphosyntactic tagset. Standardi- 
sation of tagsets has been a goal of some contem- 
porary projects (e.g. (EAGLES 94) and the Text 

at the same 



Encoding Initiative (TEI-AI1W2 91 



)); 



X LRE project EAGLES, cf. (EAGLES 94) 



time, it has been the object of much controversy be- 
cause of the obvious advantages of tailoring tagsets 
to project needs. Looking at the problem from a 
larger perspective than that of isolated projects, a 
uniform tagset has the following advantages: 

• Objectivisation and standardisation of 
similar information: Millions of words have 
been analysed in the past, using different anno- 
tation schemes. Especially the manually anal- 
ysed linguistic data is expensive to produce 
and extremely valuable. With a standardised 
tagset, linguistic information from different cor- 
pora of the same language can be reused and 
thus merged into a large data base. Such data 
bases improve the performance of statistical 
methods and are a useful resource for the pro- 
duction of balanced corpora. 

• Shared use of language resources: Corpus 
manipulation tools such as retrieval tools can 
be applied to merged resources in a uniform 
format without much customisation. As well, 
users of these tools will find it easier to work 
with a corpus tagged in a standardised tagset. 
Now, they have to memorize only one scheme of 
tag classes (class names, class semantics, excep- 
tions), as opposed to several schemes for several 
corpora before. 

• Comparison of annotation schemes: A 

comparison of the granularity and degree of sim- 
ilarity of tagsets can be carried out more objec- 
tively, once the mapping results are available. 



The validation of the suggestions of the LRE- 
project EAGLES is an application in this field. 

We believe that standards are important for the 
linguistic community, especially from the point of 
view of reusability. 

Of course, there are limits: proposals for standard 
tagsets should be regarded as approaches towards a 
neutral platform between projects and different the- 
ories, rather than as ready-made tagsets that will 
never be changed. It is important indeed that stan- 
dards and their support tools be flexible about pos- 
sible extensions and improvements . 



The more general problem of rotagging has been 



approached with tools like ICA (Mamrak O'Connel] 
|92| ), a public domain retagging tool which uses 
SGML as interlingual We also know of current work 
at Leeds University on mapping tagsets, though this 
work is concerned with the mapping of syntactic 



structure encoded in corpora (Atwell et al. 94) 



2 A standardised tagset 

When designing the architecture of a standardised 
tagset, we implemented the following constructs as 
they provide considerable advantages compared to 
the the traditional flat word labels. 

• As the tagset is constraint-based, a flexible 
generalisation is possible over all atomic con- 
straints and combinations of constraints^. As 
a formal grammar^ is used to define syntacti- 
cally well-formed specifications of word forms, 
we can regard our standard tagset as a specifi- 
cation language. 

Example: The specification [pos = v & vtype 
= aux & pers = 3] denotes 3rd person auxil- 
iary verbs. 

• The tagset is also typed, which adds to the 
naturalness of the specifications of wordforms 
and helps discover semantic errors in specifi- 
cations (inconsistent combinations of features, 
wrong values for features). In our implemen- 
tation, we follow the closed-world-assumption, 
which leads to a coherent interpretation for un- 
derspecified and/or negated descriptions. 

Example: [pos = v & (vform = fin | 
case != gen)] is a syntactically correct, 



2 Many other retagging tools are available in the 
SGML world. 

3 Constraints are expressed as attribute- value-pairs. 

4 We used a grammar for Boolean expressions with 
the usual precedences. 



but ill-typed specification, as the Types v 
(Verb) and gen (Genitive) are not type com- 
patible. 

• The tagset can be easily modified because its 
manually written definition is compiled into a 
system internal format]^] As the design of a 
tagset involves a cycle with feedback phases, 
including manual tagging and the writing of 
guidelines^], there will be frequent modifications 
to the tagset, especially in the initial phase. 



The EA GLES expert group (cf. flMonachini, Cal- 
zolari 93| )) suggested an inventory of features and 



values for a standardised morphosyntactic tagset for 
European^ languages; there are different layers, de- 
pending on language specificity as well as on appli- 
cation specificity. For the design of a standardised 
tagset in a specific language, relevant features and 
values are to be chosen from the inventory. Fig. [I] 
shows a detail of the tentative English tagset we de- 
signed and used for our tests. The type relations are 
divided into hierarchical (POS) features and non- 
hierarchical features (MO/SY). 

3 Tag mapping: the problems 

Mapping tags of an existing, flat-labeled tagsetQ or 
source annotation scheme to tags of a specification 
language (target annotation scheme) is an instance 
of the retagging problem. It is straightforward only 
in the trivial cases 1:1 (renaming) and n:l. In the 
latter case, the physical tagset makes finer distinc- 
tions than the target annotation scheme. This case 
introduces no problem for the mapping itself even if 
not all information contained in the corpus can be 
accessed. Unfortunately, what we usually find in the 
mapping business is a mixture of two more problcm- 
at ic cas es: 

| l:n | The physical tagset cannot support a dis- 
tinction intended by the specification language, e.g. 
as the distinction gender in fig. p[ Therefore, there 
is a lack of information: the corpus annotation does 
not provide the wanted distinction. 

| n:m | There is an overlap between tag classes, 
as illustrated in fig. 0. In the example case, the 



The system is implemented in Prolog, and the def- 
inition can be spelled out as a structured Prolog fact. 

6 The guidelines document is a very important re- 
source for manual taggers as well as for users of the cor- 
pus data, as it provides the semantics of the tag classes. 

7 English, French, Greek, German, Dutch, Portugese, 
Spanish, Italian, Danish. 

8 We call such a tagset physical tagset because its tags 
are actually annotated in an existing corpus, in contrast 
to the derived tags of the specification language. 




source annotation scheme includes special indefi- 
nite pronouns like anybody into the normal com- 
mon nouns, whereas some word forms (color) are 
(wrongly!) tagged as adjectives in the source anno- 
tation scheme but as common nouns in the target 
annotation scheme. 

4 Mapping Rules 

We opted for symbolic mapping rules^] and designed 
two kinds of mapping rules to deal with the discrep- 
ancies indicated above. 



We also thought about having a program deduce 
mapping rules from a corpus. The automatic learning 
of tag correspondences, at least on a semiautomatic ba- 
sis, seems possible with standard statistical means (e.g. 
HMM based learning algorithms). 

However, the amount of data needed for such an en- 
terprise (a large training corpus, (manually) annotated 
in both source and target annotation scheme) made us 
vote for the symbolic approach. 



• Class coverage rules describe a correspon- 
dence of source and target annotation classes^. 
The rule format is as follows: for each physical 
tag, the equivalent expression in the specifica- 
tion language is named. 

Example: [pos = 'NN'] => 

[n & ( common & sg | mass ) ]. 

The word forms that are annotated with the 
physical tag NN are "common singular nouns or 
mass nouns" in the terms of the specification 
language. 

• The exception lexicon provides a treatment 
of the individual discrepancy areas of case n:m, 
in order to deal with noise from unsharp map- 
pings^]. Specific lexical items can be reclassi- 

10 These rules are used in cases 1:1, n:l, l:n and in the 
agreement area of case mm. 

11 This solution accounts for lexical exceptions only. 
Contextual discrepancies like the decision to tag a cer- 
tain wordform like that in one class or in several classes 
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Figure 2: | l:n | 1 class of physical tagset <-» n classes of specification language 
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Figure 3: | n:m | n classes of physical tagset <-+ m classes of specification language 



fied, i.e. their standard mapping can be over- 
ridden. (Notation: the sign << stands for "out 
of" ) They can be reclassified in a different tar- 
get annotation scheme class instead (sign >> 
stands for "into"). 

Example: The following exception lexicon en- 
try expresses that the target tag for wordforms 
anybody, nothing ... in fig. || should not be 
the standard reading for NN ( common singular 
nouns or mass nouns), but should be described 
as an indefinite pronoun relating to persons. 

[anybody, nothing, something, anything] << [pos = 
'NN'] >> [pos=pron & antec=prs & type=indef]. 

The exception lexicon lookup takes place after the 
mapping of the class coverages. For more details, 
see ( fTeufel 94|) . 



5 Mtree: Internal representation 

After the compilation of the mapping rules, the sys- 
tem keeps the information in a data structure called 
an MTree (mapping tree), see fig. ^, which shows 
the verb mappings for UPenn. There is an MTree 
for each physical tagset regarded. MTrees contain 
a subset of the information contained in the type 
graph (see fig. ??), namely only those distinctions 
of the original type graph that are distinguishable 



(demonstrative pronoun or conjunction or relative pro- 
noun) are not dealt with in this work as this includes a 
new disambiguation run (pos-tagging) 



in the physical tagset. The new terminals (boxes 
with thick lines in fig. ^) in this pruned type graph 
correspond to physical tags (encircled tag names). 

Within the rule set, the system keeps track of con- 
sistency. Warnings are issued in case of one of the 
following inconsistencies which might occur during 
the construction of an MTree: 

• definition holes: Either target or source an- 
notation schemes are not covered by a mapping 
rule (classes have been forgotten by the person 
writing the mapping rules). 

• nondisjunctiveness of classes: A target an- 
notation class has several source annotation cor- 
respondences. Although this might be an in- 
stance of case n:l, a warning is issued, because 
most such cases occur due to a conceptual error. 

• hierarchical inconsistency: Instead of keep- 
ing a clear distinction between terminal classes 
and nonterminal classes, an odd mapping as- 
signs terminal status to ancestors of classes that 
arc terminals themselves. In fig. |], the corre- 
spondence specified by the dashed arrow intro- 
duces a hierarchical inconsistency, as it assigns 
a physical tag (VBN) to a class (con) that can- 
not be terminal because its daughters (past and 
pres) already are. 

6 System Support 

System support includes 



Figure 4: Detail of the MTree for the UPenn annotation scheme 



• Compilation of the tagset definition: useful for 
tagsets with many non-hierarchical, i.e. combi- 
natory features (which would have to be multi- 
plied out manually otherwise.) 



• Compilation of mapping rules: 
checks (cf. section [H]). 



consistency 



• Interpretation of specifications: Each specifica- 
tion is syntactically and semantically checked, 
and the corresponding (set of) physical tag(s) is 
computed, using the MTree information. Due 
to l:n and/or n:m cases (unsharp mapping), 
there can be noise (i.e. groups of word forms 
which do not conform to the specification) in 
the output. In these cases, the system antici- 
pates the noise to be expected and informs the 
user. Warnings about noise are essential for a 
correct interpretation of the output. 

Noisy word classes can be deduced from the 
MTree: In the MTree given in fig. ^, we can 
see that target specification inf (infinitives) will 
always induce noise from finite forms, namely 
subjunctive and imperative forms, because the 
physical class VB does not distinguish between 
these groups (case l:n). 



7 Results and Outlook 

For test purposes, we wrote mapping rules for the 
UPenn and SUSANNE tagsets. The number of cov- 
erage rules is equivalent to the number of physical 
tags. Rules are easy to formulate, once users have 
got used to the class semantics of the standard tag 
set. Information input are tagging guidelines, if the 
source annotation scheme comes with a comprehen- 
sive description of the intended class semantics^, or 
corpus queries otherwise, which is more time con- 
suming. 

We wrote exemplary exception lexicon entries for 
auxiliary verbs and some for noun exceptions, but 
more work can be put into the exception lexicon to 
improve the accuracy in the lexically determinable 
cases of discrepancies. 

Apart from being used for the validation of the 
EAGLES standard for English and German, the 
tool has been integrated into a corpus query system 
(Christ 94, Schulze 94) to allow for "more abstract" 
and corpus independent queries. A typical query 
(content verbs in infinitive or primary auxiliaries in 
past tense) to a specific corpus (here: UPenn) looks 



12 ( |Santorini 91 ) provides tagging guidel ines for the 
UPenn corpus, ( [Garside, Leech, Sampson 8?| ) for the SU- 
SANNE corpus. 



like this: 

Query> [(vtype=con k vform=inf) I 

(vtype=prim k tense=past)] . 

%'/, warning: Noise from [con k fin & imp] 
7.7. and from [con & fin & sub] 

7.7. (Due to tag "VB") ! 

[((pos = "VB" & word != "be I do I have" ) I 
(pos = "VBD" k word = "was I were I had I did") I 
(pos = "VBN" k word = "been|had|done"))] 

We get the information that the system will query 
for tags VB, VBD, VBN (with lexical constraints) in 
the UPenn corpus; however, we must expect to find 
finite content verbs (namely imperative and sub- 
junctive forms) in our output (l:n case). 

It would be particularly interesting to explore 
ways of how to use an MRD to build an exception 
lexicon automatically, which is especially useful for 
closed word classes. 

Another interesting case are multi-word tags and 
discrepancies with respect to the assignment of word 
boundaries (tokenising).^] Compare the following 
cases (UPenn tokenising and tagging): 

• Pctcr/NP 's/POS house 

• he/PP 's/VBZ not at home 

In our opinion, Peter's should be regarded as one 
nominal item (with genitive as value for the case 
attribute), whereas he and 's should be kept as two 
words. We are thinking about designing a rule con- 
struct to express this kind of word bundclling with 
conditional features. 
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